Overview

This document will guide you through a few data analysis and model fitting tasks.

Below, I provide commentary and instructions, and you are expected to write all or some of the missing code to perform the steps I describe.

Note that I call the main data variable d. So if you see bits of code with that variable, it is the name of the data. You are welcome to give it different names, then just adjust the code snippets accordingly.

knitr::opts_chunk$set(error = TRUE, warning=FALSE)

Project setup

We need a variety of different packages, which are loaded here. Install as needed. If you use others, load them here.

library('tidyr')
library('dplyr')
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library('forcats')
library('ggplot2')
library('corrplot') #to make a correlation plot. You can use other options/packages.
## corrplot 0.84 loaded
library('visdat') #for missing data visualization
library('caret') #for model fitting
## Loading required package: lattice
## Error: package or namespace load failed for 'caret' in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
##  namespace 'rlang' 0.3.4 is already loaded, but >= 0.4.0 is required
library('earth')
## Error in library("earth"): there is no package called 'earth'
library('reshape')
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
## 
##     rename
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
library('Metrics')
## Error in library("Metrics"): there is no package called 'Metrics'

Data loading

We will be exploring and fitting a dataset of norovirus outbreaks. You can look at the codebook, which briefly explains the meaning of each variable. If you are curious, you can check some previous papers that we published using (slighly different versions of) this dataset here and here.

#Write code that loads the dataset and does a quick check to make sure the data loaded ok (using e.g. `str` and `summary` or similar such functions).

data_raw = read.csv("./norodata.csv")
## Error in file(file, "rt"): cannot open the connection
names(data_raw)
## Error in eval(expr, envir, enclos): object 'data_raw' not found
summary(data_raw)
## Error in summary(data_raw): object 'data_raw' not found

Data exploration and cleaning

Investigating the outcome of interest

Let’s assume that our main outcome of interest is the fraction of individuals that become infected in a given outbreak. The data reports that outcome (called RateAll), but we’ll also compute it ourselves so that we can practice creating new variables. To do so, take a look at the data (maybe peek at the Codebook) and decide which of the existing variables you should use to compute the new one. This new outcome variable will be added to the data frame.

In the code sheet, RateAll is described as: Attack rate (i.e. (secondary cases + primary cases/persons at risk)*100) from outbreak. Must be described in article as a “overall attack rate”, and can NOT be calculated by reviewer (i.e. if calculations are required to discern the number of secondary cases and/or persons at risk, the attack rate can NOT be calculated by the reviewer). If this information is not available, leave the field blank.

We will use the following variables from the dataset: CasesAll and RiskAll

#fraction of individuals infected in a given outbreak: secondary cases + primary cases/persons at risk)*100
#note: the RateAll was reported by the authors, and we are checking these calculations by doing them by hand. We may calculate and attack rate that was not included in the primary literature.

# Use the `mutate()` function from the `dplyr` package to create a new column with this value. Call the new variable `fracinf`.
#d <- data_raw %>% dplyr::mutate(FILL HERE)

d = data_raw %>% dplyr::mutate(fracinf = CasesAll/RiskAll*100)
## Error in eval(lhs, parent, parent): object 'data_raw' not found
#What if the varibles had "NA"?

:::note Note the notation dplyr:: in front of mutate. This is not strictly necessary, but it helps in 2 ways. First, this tells the reader explicitly from which package the function comes. This is useful for quickly looking at the help file of the function, or if we want to adjust which packages are loaded/used. It also avoids occasional confusion if a function exists more than once (e.g. filter exists both in the stats and dplyr package). If the package is not specified, R takes the function from the package that was loaded last. This can sometimes produce strange error messages. I thus often (but not always) write the package name in front of the function. :::

:::fyi As you see in the Rmd file, the previous text box is created by placing texts between the ::: symbols and specifying some name. This allows you to apply your own styling to specific parts of the text. You define your style in a css file (here called customstyles.css), and you need to list that file in the _site.yml file. The latter file also lets you change the overall theme. You can choose from the library of free Bootswatch themes. :::

Use both text summaries and plots to take a look at the new variable you created to see if everything looks ok or if we need further cleaning.

#Write code that takes a look at the values of the `fracinf` variable you created. Look at both text summaries and a figure.

class(d$fracinf)
## Error in eval(expr, envir, enclos): object 'd' not found
head(d$fracinf)
## Error in head(d$fracinf): object 'd' not found
summary(d$fracinf)
## Error in summary(d$fracinf): object 'd' not found
d %>%
  filter(fracinf != "NA") %>%
  ggplot(aes(x = fracinf)) + geom_histogram(binwidth = 10)
## Error in eval(lhs, parent, parent): object 'd' not found

We notice there are NAs in this variable and the distribution is not normal. The latter is somewhat expected since our variable is a proportion, so it has to be between 0 and 1. There are also a lot of infinite values. Understand where they come from.

Our first observation has an infinate value.

d$CasesAll[1]
## Error in eval(expr, envir, enclos): object 'd' not found
d$RiskAll[1]
## Error in eval(expr, envir, enclos): object 'd' not found
#In this example, the "at risk" population is 0. Since the "Risk All" is the denominator, then we have an infinate value reported here for the fracinf value.

Let’s take a look at the RateAll variable recorded in the dataset and compare it to ours. First, create a plot that lets you quickly see if/how the variables differ.

# Plot one variable on the x axis, the other on the y axis


d %>%
  filter(fracinf != "NA") %>%
  ggplot(aes(x = fracinf, y = RateAll)) + geom_point()
## Error in eval(lhs, parent, parent): object 'd' not found
# also plot the difference of the 2 variables
# make sure you adjust so both are in the same units

#d %>% filter(fracinf != "NA") %>%
  #ggplot(aes(x = fracinf, y = fracinf/RateAll) + 
  #geom_point()

Both ways of plotting the data show that for most outbreaks, the two ways of getting the outcome agree. So that’s good. But we need to look closer and resolve the problem with infinite values above. Check to see what the RateAll variable has for those infinite values.

#Write code that looks at the values of RateAll where we have inifinite values

infvalues_index = which(d$fracinf == "Inf")
## Error in which(d$fracinf == "Inf"): object 'd' not found
infvalues_risk = d$RiskAll[infvalues_index]
## Error in eval(expr, envir, enclos): object 'd' not found
#In this example, the "at risk" population is 0. Since the "Risk All" is the denominator, then we have an infinate value reported here for the fracinf value.

You should find that all of the reported values are 0. So what makes more sense? You should have figured out that the infinite values in our computed variables arise because the RiskAll variable is 0. That variable contains the total number of persons at risk for an outbreak. If nobody is at risk of getting infected, of course, we can’t get any infected. So RateAll being 0 is technically correct. But does it make sense to include “outbreaks” in our analysis where nobody is at risk of getting infected? One should question how those got into the spreadsheet in the first place.

Having to deal with “weirdness” in your data like this example is common. You often need to make a decision based on best judgment.

Here, I think that if nobody is at risk, we shouldn’t include those outbreaks in further analysis. Thus, we’ll go with our computed outcome and remove all observations that have missing or infinite values for the outcome of interest, since those can’t be used for model fitting. Thus, we go ahead and remove any observations that have un-useable values in the outcome.

#Write code that removes all observations that have an outcome that is not very useful, i.e. either NA or infinity. Then look at the outcome variable again to make sure things are fixed. Also check the size of the new dataset to see by how much it shrunk.

d = d %>%
  filter(fracinf != "NA") %>%
  filter(fracinf != "Inf")
## Error in eval(lhs, parent, parent): object 'd' not found
#When filtering out the observations with "NA" and "Inf" for the fracinf value, we are left with only 579 observations of the original 1022. 

You should find that we lost a lot of data, we are down to 579 observations (from a starting 1022). That would be troublesome for most studies if that would mean subjects drop out (that could lead to bias). Here it’s maybe less problematic since each observation is an outbreak collected from the literature. Still, dropping this many could lead to bias if all the ones that had NA or Infinity were somehow systematically different. It would be useful to look into and discuss in a real analysis.

Wrangling the predictors

Not uncommon for real datasets, this one has a lot of variables. Many are not too meaningful for modeling. Our question is what predicts the fraction of those that get infected, i.e., the new outcome we just created. We should first narrow down the predictor variables of interest based on scientific grounds.

For this analysis exercise, we just pick the following variables for further analysis: Action1, CasesAll, Category, Country, Deaths, GG2C4, Hemisphere, Hospitalizations, MeanA1, MeanD1, MeanI1, MedianA1, MedianD1, MedianI1, OBYear, Path1, RiskAll, Season, Setting, Trans1, Vomit. Of course, we also need to keep our outcome of interest.

Note that - as often happens for real data - there are inconsistencies between the codebook and the actual datasheet. Here, names of variables and spelling in the codebook do not fully agree with the data. The above list of variables is based on codebook, and you need to make sure you get the right names from the data when selecting those variables.

#write code to select the specified variables

sort(names(d))
## Error in sort(names(d)): object 'd' not found
d_predict = d %>% select(Action1, CasesAll, Category, Country, Deaths, gg2c4, Hemisphere, Hospitalizations, MeanA1, MeanD1, MeanI1, MedianA1, MedianD1, MedianI1, OBYear, Path1, RiskAll, season, Setting_1, Setting_2, Trans1, Vomit, fracinf) 
## Error in eval(lhs, parent, parent): object 'd' not found
names(d_predict)
## Error in eval(expr, envir, enclos): object 'd_predict' not found
#changes made:
#gg2c4
#season
#Setting - Setting_1, Setting_2

#579 Observations of 23 Varibales 

d = d %>% select(Action1, CasesAll, Category, Country, Deaths, gg2c4, Hemisphere, Hospitalizations, MeanA1, MeanD1, MeanI1, MedianA1, MedianD1, MedianI1, OBYear, Path1, RiskAll, season, Setting_1, Setting_2, Trans1, Vomit, fracinf) 
## Error in eval(lhs, parent, parent): object 'd' not found

Your reduced dataset should contain 579 observations and 22 variables.

With this reduced dataset, we’ll likely still need to perform further cleaning. We can start by looking at missing data. While the summary function gives that information, it is somewhat tedious to pull out. We can just focus on NA for each variable and look at the text output, or for lots of predictors, a graphical view is easier to understand. The latter has the advantage of showing potential clustering of missing values.

# this code prints number of missing for each variable (assuming your dataframe is called d)

print(colSums(is.na(d))) 
## Error in is.data.frame(x): object 'd' not found
# write code to use the visdat R package, add code that plots a heatmap of missing values  

vis_dat(d)
## Error in test_if_dataframe(x): object 'd' not found
#We are missing a lot of data for MeanA1 and Median A1. We are missing just a few observations from Death and Hospitalizations

#MeanA1: Mean age of infected persons (primary cases), measured in years
#MedianA1:Median age of infected persons (primary cases), measured in years

vis_miss(d)
## Error in test_if_dataframe(x): object 'd' not found

It looks like we have a lot of missing data for the MeanA1 and MedianA1 variables. If we wanted to keep those variables, we would be left with very few observations. So let’s drop those two variables. After that, we will drop all observations that have missing data (seems to be Hospitalization and Deaths).

# write code to remove the 2 "A1" variables, then drop all remaining observations with NA

d_drop = d %>% select(-MeanA1, -MedianA1)
## Error in eval(lhs, parent, parent): object 'd' not found
vis_miss(d_drop)
## Error in test_if_dataframe(x): object 'd_drop' not found
d = d %>% select(-MeanA1, -MedianA1)
## Error in eval(lhs, parent, parent): object 'd' not found
d_drop = d %>% drop_na()
## Error in eval(lhs, parent, parent): object 'd' not found
vis_miss(d_drop)
## Error in test_if_dataframe(x): object 'd_drop' not found
#Note, this code is going to drop any observations with "NA" in either death or hospitals.

d = d %>% drop_na()
## Error in eval(lhs, parent, parent): object 'd' not found

Let’s now check the format of each variable. Depending on how you loaded the data, some variables might not be in the right format. Make sure everything that should be numeric is numeric/integer, everything that should be a factor is a factor. There should be no variable coded as character. Once all variables have the right format, take a look at the data again.

# write code to format variables as needed
vis_dat(d)
## Error in test_if_dataframe(x): object 'd' not found
#There are no varibales coded as "character"
#OBYear should be numeric

d$OBYear = as.numeric(levels(d$OBYear))[d$OBYear]
## Error in levels(d$OBYear): object 'd' not found
vis_dat(d)
## Error in test_if_dataframe(x): object 'd' not found

Take another look at the data. You should find that for the dataset, most things look reasonable, but the variable Setting_1 has a lot of different levels/values. That many categories, most with only a single entry, will likely not be meaningful for modeling. One option is to drop the variable. But assume we think it’s an important variable to include and we are especially interested in the difference between restaurant settings and other settings. We could then create a new variable that has only two levels, Restaurant and Other.

#write code that creates a new variable called `Setting` based on `Setting_1` but with only 2 levels, `Restaurant` and `Other`. Then remove the `Setting_1` variable. Note that restaurant is sometimes capitalized and sometimes not. You need to fix that first. For these lines of code, the 'Factor' chapter in R4DS might be helpful here.

d_newsetting = d %>% mutate(Setting = fct_recode(Setting_1, "Restaurant" = "restaurant"))
## Error in eval(lhs, parent, parent): object 'd' not found
#note: this line of code will not consider "take-out restaurant", "buffet", or the like as "Restaurant"

d_newsetting$Setting = fct_other(d_newsetting$Setting, keep = c("Restaurant"))
## Error in check_factor(f): object 'd_newsetting' not found
d = d_newsetting %>% select(-Setting_1, -Setting_2)
## Error in eval(lhs, parent, parent): object 'd_newsetting' not found

Data visualization

Next, let’s create a few plots showing the outcome and the predictors.

#write code that produces plots showing our outcome of interest on the y-axis and each numeric predictor on the x-axis.
#you can use the facet_wrap functionality in ggplot for it, or do it some other way.

#p = d %>% ggplot(aes(y = fracinf, x = value)) + 
  #geom_point(aes(value = CasesAll)) +  
  #geom_point(aes(value = Deaths)) +
  #geom_point(aes(value = MeanI1)) + 
  #geom_point(aes(value = MedianI1)) +
  #geom_point(aes(value = OBYear)) + 
  #geom_point(aes(value = Vomit)) + 
  #geom_point(aes(value = Deaths)) + 
  #geom_point(aes(value = MeanD1)) +
  #geom_point(aes(value = MedianD1)) + 
  #geom_point(aes(value = RiskAll)) 
  #facet_wrap(vars(CasesAll, Deaths, MeanI1, MedianI1, OBYear, Vomit, Deaths, MeanD1, MedianD1, RiskAll))

#p + facet_wrap(vars("CasesAll", "Deaths", "MeanI1", "MedianI1", "OBYear", "Vomit", "Deaths", "MeanD1", "MedianD1", "RiskAll"))

d_melted = d %>% select(CasesAll, Deaths, MeanI1, MedianI1, OBYear, Vomit, Deaths, MeanD1, MedianD1, RiskAll, fracinf)
## Error in eval(lhs, parent, parent): object 'd' not found
d_melted =  melt(d_melted, id.vars = "fracinf")
## Error in melt(d_melted, id.vars = "fracinf"): object 'd_melted' not found
d_melted %>% 
  ggplot(aes(x = value, y = fracinf)) + 
  geom_point() + 
  facet_wrap(.~variable, scales = "free_x") 
## Error in eval(lhs, parent, parent): object 'd_melted' not found

One thing I notice in the plots is that there are lots of zeros for many predictors and things look skewed. That’s ok, but means we should probably standardize these predictors. One strange finding (that I could have caught further up when printing the numeric summaries, but didn’t) is that there is (at least) one outbreak that has outbreak year reported as 0. That is, of course, wrong and needs to be fixed. There are different ways of fixing it, the best, of course, would be to trace it back and try to fix it with the right value. We won’t do that here. Instead, we’ll remove that observation.

# write code that figures out which observation(s) have 0 years and remove those from the dataset.

year_zero = which(d$OBYear == 0)
## Error in which(d$OBYear == 0): object 'd' not found
d$OBYear[219]
## Error in eval(expr, envir, enclos): object 'd' not found
d_years = d %>% filter(OBYear != 0)
## Error in eval(lhs, parent, parent): object 'd' not found
year_zero = which(d_years$OBYear == 0)
## Error in which(d_years$OBYear == 0): object 'd_years' not found
d = d %>% filter(OBYear != 0)
## Error in eval(lhs, parent, parent): object 'd' not found
# do some quick check to make sure OByear values are all reasonable now

d %>%
  ggplot(aes(y = fracinf, x = OBYear)) + 
  geom_point()
## Error in eval(lhs, parent, parent): object 'd' not found
year_1800s = which(d$OBYear < 1990)
## Error in which(d$OBYear < 1990): object 'd' not found
d$OBYear[250]
## Error in eval(expr, envir, enclos): object 'd' not found

Another useful check is to see if there are strong correlations between some of the numeric predictors. That might indicate collinearity, and some models can’t handle that very well. In such cases, one might want to remove a predictor. We’ll create a correlation plot of the numeric variables to inspect this.

# using e.g. the corrplot package (or any other you like), create a correlation plot of the numeric variables

d_cor = d %>% select(CasesAll, Deaths, MeanI1, MedianI1, OBYear, Vomit, Deaths, MeanD1, MedianD1, RiskAll, fracinf) %>% cor()
## Error in eval(lhs, parent, parent): object 'd' not found
corrplot(d_cor, method="color")
## Error in corrplot(d_cor, method = "color"): object 'd_cor' not found

It doesn’t look like there are any very strong correlations between continuous variables, so we can keep them all for now.

Next, let’s create plots for the categorical variables, again our main outcome of interest on the y-axis.

#write code that produces plots showing our outcome of interest on the y-axis and each categorical predictor on the x-axis.

#you can use the facet_wrap functionality in ggplot for it, or do it some other way.

d_catvars = d %>% select(-CasesAll, -Deaths, -MeanI1, -MedianI1, -OBYear, -Vomit, -Deaths, -MeanD1, -MedianD1, -RiskAll, fracinf)
## Error in eval(lhs, parent, parent): object 'd' not found
catvars_melted =  melt(d_catvars, id.vars = "fracinf")
## Error in melt(d_catvars, id.vars = "fracinf"): object 'd_catvars' not found
catvars_melted %>%
ggplot(aes(x = value, y = fracinf)) + 
geom_boxplot() + 
facet_wrap(.~variable, scales = "free_x") 
## Error in eval(lhs, parent, parent): object 'catvars_melted' not found

The plots do not look pretty, which is ok for exploratory. We can see that a few variables have categories with very few values (again, something we could have also seen using summary, but graphically it is usually easier to see). This will likely produce problems when we fit using cross-validation, so we should fix that. Options we have:

  • Completely drop those variables if we decide they are not of interest after all.
  • Recode values by combining, like we did above with the Setting variable.
  • Remove observations with those rare values.

Let’s use a mix of these approaches. We’ll drop the Category variable, we’ll remove the observation(s) with Unspecified in the Hemisphere variable, and we’ll combine Unknown with Unspecified for Action1 and Path1 variables.

# write code that implements the cleaning steps described above.

#drop the `Category` variable, remove the observation(s) with `Unspecified` in the `Hemisphere` variable

d_catvars = d_catvars %>% select (-Category) %>% filter(Hemisphere != "Unspecified")
## Error in eval(lhs, parent, parent): object 'd_catvars' not found
#Only one observation where Hemisphere was "Unspecified"

#combine `Unknown` with `Unspecified` for `Action1` and `Path1` variables.

d_catvars$Action1 = fct_recode(d_catvars$Action1, "Unspecified" = "Unknown")
## Error in check_factor(.f): object 'd_catvars' not found
d_catvars$Path1 = fct_recode(d_catvars$Path1, "Unspecified" = "Unknown")
## Error in check_factor(.f): object 'd_catvars' not found
# then check again (e.g. with a plot) to make sure things worked

catvars_melted =  melt(d_catvars, id.vars = "fracinf")
## Error in melt(d_catvars, id.vars = "fracinf"): object 'd_catvars' not found
catvars_melted %>%
ggplot(aes(x = value, y = fracinf)) + 
geom_boxplot() + 
facet_wrap(.~variable, scales = "free_x", ncol = 4) 
## Error in eval(lhs, parent, parent): object 'catvars_melted' not found

At this step, you should have a dataframe containing 551 observations, and 19 variables: 1 outcome, 9 numeric/integer predictors, and 9 factor variables. There should be no missing values.

d = d %>% select (-Category) %>% filter(Hemisphere != "Unspecified")
## Error in eval(lhs, parent, parent): object 'd' not found
d$Action1 = fct_recode(d$Action1, "Unspecified" = "Unknown")
## Error in check_factor(.f): object 'd' not found
d$Path1 = fct_recode(d$Path1, "Unspecified" = "Unknown")
## Error in check_factor(.f): object 'd' not found

Model fitting

We can finally embark on some modeling - or at least we can get ready to do so.

We will use a lot of the caret package functionality for the following tasks. You might find the package website useful as you try to figure things out.

Data splitting

Depending on the data and question, we might want to reserve some of the data for a final validation/testing step or not. Here, to illustrate this process and the idea of reserving some data for the very end, we’ll split things into a train and test set. All the modeling will be done with the train set, and final evaluation of the model(s) happens on the test set. We use the caret package for this.

#this code does the data splitting. I still assume that your data is stored in the `d` object.
#uncomment to run
set.seed(123)
trainset <- caret::createDataPartition(y = d$fracinf, p = 0.7, list = FALSE)
## Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]): namespace 'rlang' 0.3.4 is already loaded, but >= 0.4.0 is required
data_train = d[trainset,] #extract observations/rows for training, assign to new variable
## Error in eval(expr, envir, enclos): object 'd' not found
data_test = d[-trainset,] #do the same for the test set
## Error in eval(expr, envir, enclos): object 'd' not found

:::note Since the above code involves drawing samples, and we want to do that reproducible, we also set a random number seed with set.seed(). With that, each time we perform this sampling, it will be the same, unless we change the seed. If nothing about the code changes, setting the seed once at the beginning is enough. If you want to be extra sure, it is a good idea to set the seed at the beginning of every code chunk that involves random numbers (i.e., sampling or some other stochastic/random procedure). We do that here. :::

A null model

Now let’s begin with the model fitting. We’ll start by looking at a null model, which is just the mean of the data. This is, of course, a stupid “model” but provides some baseline for performance.

#write code that computes the RMSE for a null model, which is just the mean of the outcome

nullModel(y = data_train$fracinf)
## Error in nullModel(y = data_train$fracinf): could not find function "nullModel"
data_train %>%
  summarize(mean = mean(fracinf))
## Error in eval(lhs, parent, parent): object 'data_train' not found
#remember that from now on until the end, everything happens with the training data

Single predictor models

Now we’ll fit the outcome to each predictor one at a time. To evaluate our model performance, we will use cross-validation and the caret package. Note that we just fit a linear model. caret itself is not a model. Instead, it provides an interface that allows easy access to many different models and has functions to do a lot of steps quickly - as you will see below. Most of the time, you can do all our work through the caret (or mlr) workflow. The problem is that because caret calls another package/function, sometimes things are not as clear, especially when you get an error message. So occasionally, if you know you want to use a specific model and want more control over things, you might want to not use caret and instead go straight to the model function (e.g. lm or glm or…). We’ve done a bit of that before, for the remainder of the class we’ll mostly access underlying functions through caret.

#There is probably a nicer tidyverse way of doing this. I just couldn't think of it, so did it this way.
set.seed(1111) #makes each code block reproducible
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5) #setting CV method for caret
## Error in trainControl(method = "repeatedcv", number = 5, repeats = 5): could not find function "trainControl"
Npred <- ncol(data_train)-1 # number of predictors
## Error in ncol(data_train): object 'data_train' not found
resultmat <- data.frame(Variable = names(data_train)[-1], RMSE = rep(0,Npred)) 
## Error in data.frame(Variable = names(data_train)[-1], RMSE = rep(0, Npred)): object 'data_train' not found
    #store values for RMSE for each variable

#reordering the dataset to put the outcome in the first column
data_train <- data_train[,c(18,1:17,19)]
## Error in eval(expr, envir, enclos): object 'data_train' not found
for (n in 2:ncol(data_train)) #loop over each predictor. For this to work, outcome must be in 1st column
{
  fit1 <- train(as.formula(paste("fracinf ~",names(data_train)[n])), 
                 data = data_train, method = "lm", trControl = fitControl) 
 resultmat[n-1,2]= fit1$results$RMSE  
}
## Error in ncol(data_train): object 'data_train' not found
print(resultmat)
## Error in print(resultmat): object 'resultmat' not found

This analysis shows 2 things that might need closer inspections. We get some error/warning messages, and most RMSE of the single-predictor models are not better than the null model. Usually, this is cause for more careful checking until you fully understand what is going on. But for this exercise, let’s blindly press on!

Multi-predictor models

Now let’s perform fitting with multiple predictors. Use the same setup as the code above to fit the outcome to all predictors at the same time. Do that for 3 different models: linear (lm), regression splines (earth), K nearest neighbor (knn). You might have to install/load some extra R packages for that. If that’s the case, caret will tell you.

set.seed(1111) #makes each code block reproducible
#write code that uses the train function in caret to fit the outcome to all predictors using the 3 methods specified.

fitControl <- trainControl(method="repeatedcv",number=5,repeats=5) #setting CV method for caret
## Error in trainControl(method = "repeatedcv", number = 5, repeats = 5): could not find function "trainControl"
linear_model = train(fracinf ~ ., data = data_train, method = "lm", trControl = fitControl)
## Error in train(fracinf ~ ., data = data_train, method = "lm", trControl = fitControl): could not find function "train"
print(linear_model)
## Error in print(linear_model): object 'linear_model' not found
regression_splines = train(fracinf ~ ., data = data_train, method = "earth", trControl = fitControl)
## Error in train(fracinf ~ ., data = data_train, method = "earth", trControl = fitControl): could not find function "train"
print(regression_splines)
## Error in print(regression_splines): object 'regression_splines' not found
knn_model = train(fracinf ~ ., data = data_train, method = "knn", trControl = fitControl)
## Error in train(fracinf ~ ., data = data_train, method = "knn", trControl = fitControl): could not find function "train"
print(knn_model)
## Error in print(knn_model): object 'knn_model' not found
#report the RMSE for each method. Note that knn and earth perform some model tuning (we'll discuss this soon) and report multiple RMSE. Use the lowest value.

Linear: 25.08414 Regrssion Spines: 25.22186, 13.20065, 13.18253 K-Nearest Neighbor: 9.917903, 20.454868, 11.093563

So we find that some of these models do better than the null model and the single-predictor ones. KNN seems the best of those 3. Next, we want to see if pre-processing our data a bit more might lead to even better results.

Multi-predictor models with pre-processing

Above, we fit outcome and predictors without doing anything to them. Let’s see if some further processing improves the performance of our multi-predictor models.

First, we look at near-zero variance predictors. Those are predictors that have very little variation. For instance, for a categorical predictor, if 99% of the values are a single category, it is likely not a useful predictor. A similar idea holds for continuous predictors. If they have very little spread, they might likely not contribute much ‘signal’ to our fitting and instead mainly contain noise. Some models, such as trees, which we’ll cover soon, can ignore useless predictors and just remove them. Other models, e.g., linear models, are generally performing better if we remove such useless predictors.

Note that in general, one should apply all these processing steps to the training data only. Otherwise, you would use information from the test set to decide on data manipulations for all data (called data leakage). It is a bit hard to say when to make the train/test split. Above, we did a good bit of cleaning on the full dataset before we split. One could argue that one should split right at the start, then do the cleaning. However, this doesn’t work for certain procedures (e.g., removing observations with NA).

#write code using the caret function `nearZeroVar` to look at potential uninformative predictors. Set saveMetrics to TRUE. Look at the results 

nzv = nzv <- nearZeroVar(data_train, saveMetrics= TRUE)
## Error in nearZeroVar(data_train, saveMetrics = TRUE): could not find function "nearZeroVar"
nzv[nzv$nzv,][1:10,]
## Error in eval(expr, envir, enclos): object 'nzv' not found
dim(data_train)
## Error in eval(expr, envir, enclos): object 'data_train' not found

You’ll see that several variables are flagged as having near-zero variance. Look for instance at Deaths, you’ll see that almost all outbreaks have zero deaths. It is a judgment call if we should remove all those flagged as near-zero-variance or not. For this exercise, we will.

#write code that removes all variables with near zero variance from the data 

nzv = nearZeroVar(data_train)
## Error in nearZeroVar(data_train): could not find function "nearZeroVar"
filtered_data_train = data_train[, -nzv]
## Error in eval(expr, envir, enclos): object 'data_train' not found
dim(filtered_data_train)
## Error in eval(expr, envir, enclos): object 'filtered_data_train' not found

You should be left with 13 variables (including the outcome).

Next, we noticed during our exploratory analysis that it might be useful to center and scale predictors. So let’s do that now. With caret, one can do that by providing the preProc setting inside the train function. Set it to center and scale the data, then run the 3 models from above again.

#write code that repeats the multi-predictor fits from above, but this time applies centering and scaling of variables.

set.seed(1111) 

fitControl = trainControl(method="repeatedcv",number=5,repeats=5) #setting CV method for caret
## Error in trainControl(method = "repeatedcv", number = 5, repeats = 5): could not find function "trainControl"
linear_model2 = train(fracinf ~ ., data = filtered_data_train, method = "lm", trControl = fitControl, preProc = c("center", "scale"))
## Error in train(fracinf ~ ., data = filtered_data_train, method = "lm", : could not find function "train"
print(linear_model2)
## Error in print(linear_model2): object 'linear_model2' not found
regression_splines2 = train(fracinf ~ ., data = filtered_data_train, method = "earth", trControl = fitControl, preProc = c("center", "scale"))
## Error in train(fracinf ~ ., data = filtered_data_train, method = "earth", : could not find function "train"
print(regression_splines2)
## Error in print(regression_splines2): object 'regression_splines2' not found
knn_model2 = train(fracinf ~ ., data = filtered_data_train, method = "knn", trControl = fitControl, preProc = c("center", "scale"))
## Error in train(fracinf ~ ., data = filtered_data_train, method = "knn", : could not find function "train"
print(knn_model2)
## Error in print(knn_model2): object 'knn_model2' not found
#look at the RMSE for the new fits

So it looks like the linear mode got a bit better, KNN actually got worse, and MARS didn’t change much. Since for KNN, “the data is the model”, removing some predictors might have had a detrimental impact. Though to say something more useful, I would want to look much closer into what’s going on and if these pre-processing steps are useful or not. For this exercise, let’s move on.

Model uncertainty

We can look at the uncertainty in model performance, e.g., the RMSE. Let’s look at it for the models fit to the un-processed data.

#Use the `resamples` function in caret to extract uncertainty from the 3 models fit to the data  that doesn't have predictor pre-processing, then plot it

resamps = resamples(list(lm = linear_model,
                          rsplines = regression_splines,
                          knn = knn_model))
## Error in resamples(list(lm = linear_model, rsplines = regression_splines, : could not find function "resamples"
summary(resamps)
## Error in summary(resamps): object 'resamps' not found
splom(resamps)
## Error in splom(resamps): object 'resamps' not found

It seems that the model uncertainty for the outcome is fairly narrow for all models. We can (and in a real setting should) do further explorations to decide which model to choose. This is based part on what the model results are, and part on what we want. If we want a very simple, interpretable model, we’d likely use the linear model. If we want a model that has better performance, we might use MARS or - with the un-processed dataset - KNN.

Residual plots

For this exercise, let’s just pick one model. We’ll go with the best performing one, namely KNN (fit to non-pre-processed data). Let’s take a look at the residual plot.

#Write code to get model predictions for the outcome on the training data, and plot it as function of actual outcome values.

outcome_knn = predict.train(knn_model)
## Error in predict.train(knn_model): could not find function "predict.train"
ggplot()+ 
  geom_point(aes(x = outcome_knn, y = data_train$fracinf)) 
## Error in FUN(X[[i]], ...): object 'outcome_knn' not found
#also compute residuals (the difference between prediction and actual outcome) and plot that

ggplot()+ 
  geom_point(aes(x = outcome_knn, y = data_train$fracinf - outcome_knn)) + 
  ggtitle("Residuals")
## Error in FUN(X[[i]], ...): object 'outcome_knn' not found

Both plots look ok, predicted vs. outcome is along the 45-degree line, and the residual plot shows no major pattern. Of course, for a real analysis, we would again want to dig a bit deeper. But we’ll leave it at this for now.

Final model evaluation

Let’s do a final check, evaluate the performance of our final model on the test set.

#Write code that computes model predictions and for test data, then compute SSR and RMSE.

test_predictions = predict(knn_model, data_test)
## Error in predict(knn_model, data_test): object 'knn_model' not found
#Sum of Squared Residuals (SSR) is the sum of the squares of residuals (deviations predicted from actual data)

#Calculate the residuals
test_residuals_sq = (test_predictions - data_test$fracinf)^2
## Error in eval(expr, envir, enclos): object 'test_predictions' not found
ssr = sum(test_residuals_sq)
## Error in eval(expr, envir, enclos): object 'test_residuals_sq' not found
#RMSE is the standard deviation of the residuals
rmse(data_test$fracinf, test_predictions)
## Error in rmse(data_test$fracinf, test_predictions): could not find function "rmse"
ggplot() + 
  geom_point(aes(x = data_test$fracinf, y = test_predictions))
## Error in FUN(X[[i]], ...): object 'data_test' not found

SSR: 16443.44 RMSE: 10.01324

Since we have a different number of observations, the result isn’t expected to be quite the same as for the training data (despite dividing by sample size to account for that). But it’s fairly close, and surprisingly not actually worse. So the KNN model seems to be reasonable at predicting. Now if its performance is ‘good enough’ is a scientific question.

We will leave it at this, for now, we will likely (re)visit some other topics soon as we perform more such analysis exercises in upcoming weeks. But you are welcome to keep exploring this dataset and try some of the other bits and pieces we covered.